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Abstract 

Next Generation Sequencing (NGS) technologies generate large amounts of short read data for 
many different organisms. The fact that NGS reads are generally short makes it challenging to 
assemble the reads and reconstruct the original genome sequence. For clustering genomes using 
such NGS data, word-count based alignment-free sequence comparison is a promising approach, but 
for this approach, the underlying expected word counts are essential. 

A plausible model for this underlying distribution of word counts is given through modelling the 
DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available 
to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do 
not provide a single long sequence, inference methods on Markovian properties of sequences based 
on single long sequences cannot be directly used for NGS short read data. 

Here we derive a normal approximation for such word counts. We also show that the traditional 
Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for 
physical mapping. We propose several methods to estimate the order of the MC based on NGS 
reads and evaluate them using simulations. We illustrate the applications of our results by clustering 
genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free 
sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect 
on the clustering results, and that the clustering results that use a MC of the estimated order give a 
plausible clustering of the species. 

Our implementation of the statistics developed here is available as R package “NGS.MC” at 

http://www-ref.usc.edu/-fsun/Programs/NGS-MC/NGS-MC.html 
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1 Introduction 

NGS technologies generate large amounts of overlapping short read data for many different organisms; 
for example a read is a subsequence of less than 400 bps for Illumina and 700 bps for 454 sequencing 
technologies, and can sometimes be much shorter. The fact that NGS reads are generally short makes it 
challenging to reconstruct the original genome sequence. 

*to whom correspondence should be addressed; fsun@usc.edu 
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Recently several word-count based alignment-free sequence comparison methods have been applied 
to infer the relationship among different species (Song et al., 2013] Yi and Jinj 2013| ) and metagenomic 
samples ( |Behnam and Smith] 2014] Hurwitz et al.| 2014] pTang et al.| 201 2\ [Wang et al. 2014) based 


on NGS reads without assembly. Our alignment-free sequence dissimilarity measures, d\ and d% (Song 


et al.[|2014]|2013| ), and their variants (Behna m et al.||20l3]|Liu et al.[|201 1 l|Ren et aL]|2013[ ) have shown 
promise. These methods require the knowledge about the approximate distribution of word counts in the 
underlying sequences. While a model which assumes that all letters in the sequence are equally likely 


is relatively straightforward to analyse, see Reinert et al. (2009), a Markov model for the underlying 
sequences is more realistic. 


Markov chains (MC) have been widely used to model molecular sequences (Almagor [1983) with 
many applications including the study of dependencies between the bases (jBlaisdell 1985), the enrich¬ 
ment and depletion of certain word patterns (Pevzner et al.| 1989| ), prediction of occurrences of long word 
patterns from short patterns ( Arnold et al.[ 1988; Hong, 1990), and the detection of signals in introns (Av¬ 
ery] 1987). Narlikar et al. ( 2013J ) studied the effect of the order of MCs on several biological problems 
including phylogenetic analysis, assignment of sequence fragments to different genomes in metagnomic 
studies, motif discovery, and functional classification of promoters. These applications showed the im¬ 
portance of accurate specification of the order of MCs. Reliable estimators for the order of the MC and 
the transition probability matrix based on the sequence data are crucial. 

Based on relatively long molecular sequences, for a general finite state MC sequence of letters from 


a finite alphabet A = {1, 2, • • • , L} of size L, Hoel (1954) showed, under the hypothesis that the long 


sequence follows a (k — 2)-th order MC, that twice the log-likelihood ratio of the probability of the 
sequence under a (k — l)-th order MC versus that under the (k — 2)-th order MC model follows approx¬ 
imately a ^-distribution with df k = (L — 1 ) 2 L k ~ 2 degrees of freedom under general conditions. He 
also approximated the log-likelihood ratio by the Pearson-type statistic 
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which is also approximately y-distributed with the same degrees of freedom. Here, w = w\W 2 • • • w k 

denotes a fc-word formed of letters Wi E A , w = W 2 ■ ■ ■ w k , w~ = w\W 2 ■ ■ ■ w k ~ i, and = 

N_ N 

W 2 ■ ■ ■ w k - 1 ; A w denotes the count of the word w in the sequence, and E w = w is the estimated 


expected count of w if the sequence is generated by a MC of order k — 2. Here k > 3; see also Avery and 
Henderson (19991 for a detailed study, Billingsley ( |1961a|b l for an an excellent exposition of statistical 
issues related to MCs, as well as |Ewens and Grant] ( [2005 ); Reinert et al. ( 2000 2005 1 ; Waterman] ( 1995 ) 
for applications to sequence analysis. 

The Chi-square statistic ([!]) and the log-likelihood ratio statistics can be used to test the order of a MC, 
using all k -words w E A k . When a particular order of MC is rejected, we can identify particular word 
patterns that are exceptional, through the approximate distribution of A w . The approximate distributions 


of A w in long sequences is well understood, see for example Reinert et al. (2000 2005); Waterman 


(1995). In particular, suppose that the sequence follows a stationary (k — 2)-th order MC and let 
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Theorem 6.4.2 in Reinert et al. (2005) gives that, as sequence length goes to infinity, for all real values 
x, P (Z w < x) —> Tjx). where <f> denotes the cumulative distribution function of a standard normal 
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variable. We also say that Z w converges to the standard normal distribution N( 0,1) in distribution. This 
asymptotic result can then be used to find exceptional words in long sequences. 

Given an NGS short read sample, it is tempting to use the test statistic S k defined in ([!]) to test the 
order of a MC by simply counting the number of the occurrences of words in short read data. However, 
as the short reads from NGS data are sampled randomly from the genome, some parts of the genome 
are possibly not sampled and some parts are possibly sampled extensively. The sampling process intro¬ 
duces additional randomness to the statistic, and makes S k deviate from its traditional x 2 -distribution. 
Similarly, the approximate distribution of Z w given in ([2]) will be different from the standard normal 
distribution. 

In this paper, we study these approximate distributions, both theoretically and by simulations. First 
we extend the statistics Sk and Z w for a MC sequence to Sj' ! and Z^ for the NGS read data. Our un¬ 
derlying model for the distribution of reads along the genome is the potentially inhomogeneous Lander- 
Waterman model for physical mapping (Lander and Waterman 1988). We discover that for a set of short 
reads sampled from a (k — 2)-th order MC sequence, the statistic S[f follows approximately a gamma 
distribution with shape parameter df k /2 and scale parameter 2d, where d is a factor related to the distri¬ 
bution of the reads along the genome. We also show that, with the same factor d, the distribution of the 
single word statistic Z^/sfd tends to the standard normal distribution. Based on the theoretical results, 
we introduce an estimator for the order of the MC using NGS data. For practical purposes, we also give 
an estimator for the factor d when the underlying reads sampling distribution is unknown. To the best of 
our knowledge, this is the first study of the Markovian properties of molecular sequences based on NGS 
read data. 

To illustrate our theoretical results and our estimators, we first carry out a simulation study based on 
transition probability matrices which are estimated from cis-regulatory module (CRM) DNA sequences, 
and insert repeats. We simulate different read lengths, numbers of reads, inhomogeneous sampling, as 
well as sequencing errors, and we include a regime where the sampling rate depends on the GC content. 
If the GC bias is not very strong or the sequencing depth is not very low, then the simulation results agree 
with our theoretical predictions despite the theoretical assumptions being slightly violated. 

Next we apply our methods to cluster 28 vertebrate species using our alignment-free dissimilarity 
measures d\ and df under different MC models which are estimated from NGS read samples. The 
estimated orders based on NGS data without assembly are found to be consistent with those inferred 
directly from the long genome sequences. The clustering performs best when using MCs around the 
estimated order. Applying the same analysis to 13 tropical free species whose genomes are unknown, 
based on their NGS read samples, the most plausible clustering is achieved when using a MC model of 
order close to the one estimated from the NGS reads. 

The paper is organized as follows. The “Methods” section contains the probabilistic models of 
generating the MC sequence and sampling the short reads, as well as the theorem for the approximate 
distributions of S k and Z^ for NGS data. This theorem is used to derive our estimators for the order 
of the MC and for the factor d. In the “Results” section, we first provide extensive simulation studies 
including the comparison of the theoretical approximate distributions and the simulated results for Sj} 
and Z^, the effect of inhomogeneous sampling and sequencing errors, the efficiency of the estimator 
of the factor d, and the evaluations of the methods for estimating the MC order. Second, we estimate 
the orders of the MCs for 28 vertebrate species based on the simulated whole genome NGS samples. 
We then use our dissimilarity measures dd 2 and df to cluster the NGS samples of the 28 species under 
different MC orders to see the effect on the performance of the clustering. The applications show that 
our new methods are effective for the inference of relationships among sequences based on NGS reads. 
Finally, we use our methods to study the relationships among 13 free species whose complete genomic 
sequences as well as their phylogenetic relationships are unknown. Our clustering results are consistent 
with the physical characteristics of the tree species. The paper concludes with some discussion of the 
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study. 


2 Methods 


2.1 Probabilistic modeling of a MC sequence and random sampling of the reads using 
NGS 

In NGS, a large number of reads are randomly sampled from the genome. Hence two random processes 
are involved in the generation of the short read data: the generation of the underlying genome sequence 
and the random sampling of the reads. 

We use an r-th order homogeneous ergodic MC to model the underlying genome sequence with each 
letter taking values in a finite alphabet set A of size L. Since our study is based on genomic sequences, 
L = 4. As in Daley and S mi th ( |2013| ); Lander and Waterman (19881; Simpson (2014 1 ; Zhai et al. ( 201 2[ ) ; 
Zhang et al .j(2008), we assume that the genome is continuous and that the distribution of reads along the 
genome follows a potentially inhomogeneous Poisson process with rate c(x ) at position x. If c(x) = c 
for all x, we refer to the sampling of the reads as homogeneous. We assume that all sampled reads have 
the same length of (3 bps. A total of M reads are independently sampled from the genome of length G 
bps. 

We extend the statistics Sp. and Z w in ([T]) and (|2| to NGS short read data accordingly. Let be 
the number of occurrences of the A;-word w in the short read data, where the superscript R refers to the 
“read” data, and define 
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We have the following theorem on the approximate distributions of Sj} and Zlf:. the proof is given in 
the Supplementary Materials. Note that for each read we discard the last k — 1 positions as they would 
lead to words of length less than k: the error made with this approximation is asymptotically negligible 
when k is small relative to (3. 


Theorem 1 Assume that the underlying genome follows a (k — 2 )-th order MC which assigns non-zero 
probability to every k-word w. Let Sj} and be defined as in ([j]) and respectively. Suppose 
that the genome of length G can be divided into (not necessarily contiguous) regions with constant 
coverage r, for the i-th region, so that every base is covered exactly r, times, based on the first /3 — k + 1 
positions of the reads. Let Gi be the length of the i-th region that changes with G in a way such that 
limG '_ s . 00 Gi/G = fi > Ofor the i-th region, i = 1, 2, • • •. Let 


d 


>1, r ffi 

Ei r ifi ' 


( 5 ) 


Then, as G H> oc, 

a) For each k-word w, in distribution, Z^/\fd —> A r (0,1). 


4 































b) The statistic Sjp/d has an approximate \ 2 -distribution with dff. = (L — 1 ) 2 L k ~ 2 degrees of free¬ 
dom; equivalently, the statistic Sp has an approximate gamma distribution with shape parameter 
(If l, /2 and scale parameter 2d. 


If the M reads are sampled homogeneously along the genome with coverage c based on the first 3 — 


k +1 positions of the reads along the genome, i.e. c = 


_ M{P-k+l) 


G-k+1 


, the Lander-Waterman formula (Lander 


and Waterman 1988) shows that the fraction of genome covered r* = i times is /,; = exp (—c)c l /il. 


Under this assumption, we obtain 


d = 


E, 


i 2 fi 


c 2 + c 




= c+ 1. 


The results in Theorem [I] continue to hold when taking d = c+1. 

In the Lander-Waterman model for physical mapping ( Lander and Waterman| 19881, the factor c = 
^pf is the coverage of the genome. Hence we refer to d from <(5]) as the effective coverage of the reads 
along the genome based on the first f3 — k + 1 positions of each read. 


2.2 Estimating the order of the MC based on NGS reads 

Based on Theorem [T] we can estimate the order r of a MC sequence using NGS reads. First, we test 
the null hypothesis that the sequence follows an independent identically distributed (i.i.d; MC order = 0) 
model. For a test at significance level a, if Spp/d is higher than the 1 — a quantile of the ^-distribution 
with df = (L — l) 2 degrees of freedom, the i.i.d hypothesis is rejected. If this null hypothesis is 
rejected, then here we propose an estimator for the order of a MC; it is an analog of a corresponding 
established estimator of MC orders based on long sequences that has been shown to be effective. In the 
Supplementary Materials we present four related estimators as well as estimators based on the AIC and 
BIC information criteria; the one presented here has the best performance in simulation studies. 

We assume that the word length k > 2 and that the assumptions of Theorem [T] are satisfied. Then, 
for k > r + 2, Sjp/d has approximately a ^-distribution with (L — \) 2 I k 2 degrees of freedom. If 
k < r + 2, then Spp/d will typically be larger than expected from this \ 2 -distribution. For k > r + 2. the 

S R 

law of large numbers gives that -jfff —y 1 for G —f oo; if /;: < r + 2 then the ratio will be much larger 
than 1 in the limit. Therefore we can estimate r as follows: 


r Sk = argmirq. 
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In general, we want the value of rnirifc 


Sj? +l 


to be very small, e.g, less than 0.01. 


Using the law of large numbers it can be shown that under our assumptions this estimator is consis¬ 
tent, in the sense that fs k tends to r in probability as G tends to infinity. 


2.3 Estimating the effective coverage d 

Often the effective coverage d is not known and we would like to estimate the effective coverage d 
using NGS short read data. From Theorem [T] we can see that, under the general conditions stated in 
the theorem, (Z^) 2 jd follows a x 2 -distribution with one degree of freedom. Since the median of the 
X 2 -distribution with one degree of freedom is about 0.456, we can use the scaled median as a robust 
estimator for d\ 

d = median{(Z^) 2 , w e yl fc }/0.456. (7) 
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When we assume that the underlying long sequence follows a MC of order at most m, we use 
(m + 2)-words to estimate d using ([7]). 

Note that for an i.i.d. model sequence, the set of 2-words would not yield meaningful results as 
there are only 16 different 2-words and the median based on 16 numbers is generally not reliable. As an 
underlying genome sequence following an r-th order MC can also be seen as an (r + l)-th, (r + 2)-th, 
..., and higher order MC sequence, we can use fc-words with relatively large k (> r + 2) to estimate the 
factor d, if the maximum order of a MC is unknown beforehand. 


2.4 Simulation study 


For the simulation study, we first generate MCs of different orders. For realistic parameter values, the 
transition probability matrices of the MCs are based on real cis-regulatory module (CRM) DNA se¬ 
quences in mouse forebrain from Blow et al. (2010). We use CRM sequences here because CRM 


sequences are often used to study the effectiveness of alignment-free sequence dissimilarity measures 
( Goke et ah] 2012 Ren et all] 201 3[ [Song et al.| 20141. To take into consideration that in real genomic 
sequences, many repeat regions are present, we insert repeats into the generated MCs. We simulate NGS 
data by sampling a varying number of reads of different lengths from the MC, varying genome length as 
well as coverage. 

We include homogeneous and inhomogeneous sampling of the reads as well as sequencing errors. 
We also let the sampling rate of the reads depend on the GC content of the fragments based on data from 
the current sequencing technologies (Benjamini and Speed| 2012). We set the sequencing error rate at 
10%, which is relatively high compared to the true sequencing error rate in real sequencing in order to 
clearly distinguish among the estimators with regards to their robustness to sequencing errors. When a 
sequencing error occurs at a position, the nucleotide base is changed to one of the other three nucleotides 
with equal probability. 

Once the NGS reads are generated, we calculate the statistics Sj} and for each word w, the order 
estimator rs k and the estimator for effective coverage d based on O; each procedure is repeated 1000 
times. In each repeat experiment, we let the order estimator choose the model from 1st, 2nd, • • •, 5th 
order MCs; we estimate the effective coverage d by ([7]), using 3-tuples for a first order MC, and 4-tuples 
for a second order MC. The details are given in the Supplementary Materials. 


2.5 Applications to the study of relationships among organisms 

We test our methods on real and simulated NGS data from 28 vertebrate species whose complete genomic 
sequences are available and that are comprehensively studied in (Karolchik et al., 2008 [ Miller et~ah| 


2007). We download the genomes of the 28 vertebrate species from UCSC Genome Browser, and then 


use MetaSim (Richter et al. 2008) to simulate reads from each of the 28 vertebrate species. In simulations 
the accuracy of the order estimation increases with read coverage. To reflect a worst-case scenario, we 
set the read coverage to be 1 as a lower bound for the performance although the current sequencing 
technology can generate data with very high read coverage. We set MetaSim to generate reads of length 
62bp under the error rate which is estimated by Illumina in our simulations. 

To estimate the order of MC based on the NGS sample for each of the 28 species, we apply the 


order estimator f\s k in (121; there is no sharp ratio transition found over k = 2, • • • ,14. Given that real 


genomes consist of multiple types of regions (coding, non-coding and regulatory regions) and each type 
may fit to different MC models, the result indicates that no suitable MC model can adequately fit all 
the patterns in the genome. Instead, we fit the data with a MC model that can explain the majority (say 
80%) of the word patterns in the genome. Motivated by the normal approximation of a particular word 
statistic in Theorem [I] we study the fraction of A;-words whose occurrences can be explained using the 
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statistic (Z^) 2 /d by comparison to a x 2 -distribution with one degree of freedom with type I error 0.01. 
We estimate the order of MC to be the smallest k — 2 under which more than 80% of /e-words can be 
explained by the (k — 2)-th order MC. 

To cluster the organisms, we use the inferred MC models to estimate the expected number of oc¬ 
currences of word patterns and then study the relationships among the organisms using our dissimilarity 
measures d\ and e?f. We briefly present their definitions below, please see Song et al. (2014 2013) for 
details. Then we apply a similar approach to study the relationships among 13 tree species with NGS 
reads, for which neither the complete genome sequences nor their relationships are known. To estimate 
the unknown effective coverage d using /c-words by (J7J, we let k to be relatively large and use d as the 
value at which the estimated d stabilizes as k increases. 


2.6 Alignment-free sequence comparison dissimilarity measures 

Consider two sets of NGS reads from two genomes. We use superscripts (1) and (2) to denote the first 
and the second read set, respectively. Suppose that reads of length /Jw are in the /-th data set. 

Since the reads can come from either the forward strand or the reverse strand of the genome in NGS, 
we supplement the observed reads by their complements and refer to the joint set of the reads and the 
complements as the read set. 

(i) (i) 

Let A’ w be the count of the word w in the v'-th data set. We define EN w to be the expected number 

of occurrences of word w based on either the i.i.d model or a Markov model, EN$ = _ 

k + l)(pw + Pw ); where MW(/?W — k + 1) is the total number of /c-word in the i -th sample, w is the 

(i) 

complement of word w, and p w is the probability of word w in the /-th genome under a specific model. 
Then we define D\ and D 2 as follows, 
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where N^, = JV* — EN W , i = 1,2. Further, the dissimilarity measures d,* 2 and d 2 , ranging from 
0 to 1, are defined as, 
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3 Results 


3.1 Summary of simulation results 

Due to page limitations, we summarize the simulation results here; details are given in the Supplementary 
Materials. Our extensive simulations show that the simulated mean, standard deviation and distributions 
of S[! and are very close to their corresponding theoretical approximations given by Theorem [I] 
Both the effective coverage and the MC order can be estimated accurately under the parameter settings 
of the current sequencing technologies. 


3.2 The relationship among 28 vertebrate species 


Table S4 shows the estimated orders of MCs for a group of 28 vertebrate species that are studied in 
( Karolchik et al. 2008 Miller et al. 2007 1 based on simulated NGS short reads. For each of the 28 
species, we compute the fraction of the fc-words that have (Z^) 2 /d within the 99% of a ^-distribution 
with one degree of freedom, for k = 8,9,..., 14. Using 80% as a threshold, we estimate the order of 
MC for each species to be the smallest k — 2 under which the fraction of words that can be explained by 
the (k — 2)-th order MC is greater than the threshold. 

Comparing our results with the results in Narlikar et al. ( 2013), where the order of MCs for a selection 
of vertebrate genomes was estimated by AIC and BIC criteria using whole genome sequences, we find 
that the estimated order based on NGS read data are almost the same as that estimated based on the whole 


genome sequences in Narlikar et al. ( 2013 ). Our proposed methods of estimating the order of MC based 
on short reads of NGS data achieve the same accuracy as that based on whole genome sequences. 

For a given value of k, we compute d 2 and d 2 using an r-th order MC, r = 0 (i.i.d model),..., (k— 2) 
for each pair of species, yielding a 28 x 28 pairwise dissimilarity matrix under each MC model. To 
evaluate the dissimilarity measures, we use the pairwise distance matrix obtained from Figure SI in 
Miller et al. (|2007 1 as the gold standard for the dissimilarity between each pair of the 28 species; the 


matrix is given as Table S5 in the Supplementary Materials. Note that the estimated orders of the 28 
species range from 7 to 11, and the average order is 10. To study the performance of the dissimilarity 
measures under different orders of MC, we choose k = 14 such that we can study the results under the 
MC model with orders up to 12. 

Table [T] shows Spearman’s rank correlation coefficient (SPCC) between the standard distance and 
the dissimilarity estimated by the cfe-type measures under MC models of various orders; higher SPCC 
indicates better performance. Both measures, d 2 and (if, achieve their best results of SPCC=0.92 when 
using a MC of order 12. Note that using a simplistic dissimilarity measure d 2 only gives SPCC=0.08. 

In general both d 2 and (if obtain higher SPCC with the standard matrix as the order of MC increases, 
except for (if at order 9. In particular, the measure d 2 has negative correlation coefficient with the 
standard distance under the i.i.d model. The SPCC becomes stable when the order of the MC used for 
the analysis is close to 11, the maximum estimated MC orders over the 28 species. Here (if is less 
affected by the order of the MC than d 2 . When the appropriate order of MC is used, d 2 and (if perform 
similarly and much better than d 2 . 


rU-typc 

order=0 

order=5 

order=9 

order=10 

order=l 1 

order=12 

d* 2 

-0.21 

-0.16 

0.85 

0.89 

0.90 

0.92 

4 

0.86 

0.87 

0.85 

0.88 

0.90 

0.92 


Table 1: The Spearman's rank correlation coefficient (SPCC) between the true distance matrix and the 
dissimilarity matrix by r^-type dissimilarity measures under MC models with order 0 (i.i.d), 5, 9, 10, 11 
and 12. The length of the /c-tuple word is 14. 
























3.3 The relationship among 13 tropical tree species with unknown reference genomes 


We also apply our method to the 13 tree species based on the NGS shotgun read data sets in Cannon 


et al. (2010). The reference genome sequences for the 13 tree species are unknown. Our objective is to 


cluster these tree species using d\ and df with MCs for the sequences. 

The estimated order of the MC for all the 13 tree species is 8. We use the dissimilarity measures d\ 
and df under various orders of MC as the background model to cluster the 13 tree species from their 
NGS reads. We choose A; = 11 so that we explore the MC with order up to 9. We use the Unweighted 
Pair Group Method with Arithmetic Mean (UPGMA) to cluster the tree species. 

The 13 trees species can be generally classified into two groups: 5 tree species from Moraceae 
and 8 tree species from Fagaceae. The two Moraceaes, Ficus altissima and Ficus microcarpa, should 
cluster together because they are known to be closely related and are both large hemiepiphytic trees 
while the other three Moraceae species are small dioecious shrubs. Within the Fagaceae group, the 
two Castanopsis species should cluster together, and the five Litliocarpus species should also form a 
subgroup. Trigonobalanus doichangensis (Fagaceae) is an ancestral genus that is very divergent from 
the rest of the family and has undergone considerable sequence evolution. It should not group within the 
class of Castanopsis and Litliocarpus in Fagaceae. 

Figure [T] shows the clustering results of the 13 tree species using df 2 under MCs of order 0 (i.i.d), 
4, 8 and 9. The frees are built based on all the reads. From the results we can see, under the i.i.d 
model, Litliocarpus mixes up with Castanopsis', T. doichangensis can not be separated from the rest of 
Fagaceae, while under the MC of order greater than 4, T. doichangensis is successfully separated from 
the rest of the Fagaceae. Moreover, within the Moraceae group. Ficus fistulas and Ficus langkokensis 
form a subgroup under the i.i.d model, and they are separated under the MC with order greater than 4. 
While F. langkokensis is the closest Maraceae to the Fagaceae under 4th order MC, F.fistulosa becomes 
the closest species to the Fagaceaes under 8th and 9th order MCs. 

In order to see whether the clustering of the tree species can be correctly inferred using only a portion 
of the shotgun read data, we randomly sample 10% of the total read data for each tree species to cluster 
them. To study the variation of the clusters due to random sampling of the reads, we repeat the sampling 
process 30 times and calculate the frequencies of each internal branch of the clustering using all the 
reads occurring among the 30 clusterings. The number on the branch refers to the frequency of the 
branch occurring among the 30 clusterings based on random sampled 10% reads. Three branches of the 
tree under MC of order 9 have frequencies of occurrence less than 30. When using the MC of a very 
high order, the clustering becomes unstable. 

For the clustering results using df > see Figure S7. Under MC with all four orders, the two Cas¬ 
tanopsis and the five Litliocarpus species are grouped separately, and F. altissima (Moraceae) and 
F.microcarpa (Moraceae) are clustered together. Under the i.i.d model, T.doichangenesis (Fagaceae) 
is successfully separated from Litliocarpus, but it is not the most outside species in the Fagaceae group. 
When the MC order is greater than 4, T.doichangenesis (Fagaceae) gets separated from the rest of the 
Fagaceaes. It can also be seen that when using the i.i.d model, or a MC with order 8 or greater, some of 
the branches becomes unstable. 

In general, the results show that the clustering becomes more accurate as the order of MC increases 
using both d 2 and df. Under the i.i.d model, the clustering based on d.j does not correctly separate 
Castanopsis from Litliocarpus, while the clustering based on df groups the two types separately. With 
higher order MCs, d 2 successfully separates Castanopsis from Litliocarpus. The general clustering struc¬ 
ture among Litliocarpus, Castanopsis, Trigonobalanus and Ficus stays correct when order is greater than 
4 for both measures. When using the MC with order higher than the estimated order, the clustering is 
unstable and indeed the branch for L.Hancei (Fagaceae) is not supported on the last tree when using 
only 10% of the data. With a large number of parameters to estimate, 10% of the data does not suffice to 
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capture the information in the data. The best clustering is achieved under a MC of order 8 and 9. 


4 Discussion 


Next generation sequencing technologies provide large amount of data in the form of short reads. As¬ 
sembly of the millions of short reads to recover the long sequence is challenging, because the relative 
short length of the reads makes it difficult to resolve the repeat regions, not all regions may be cov¬ 
ered, and assembly is time consuming. While multiple sequence alignment may be prohibitive, we can 
use word-count based dissimilarity measures to cluster the underlying species. These measures require 
an underlying probability model for the sequences; Markov chains arc a reasonable model for such se¬ 
quences. While transition probabilities can be estimated directly from count data, estimating the order 
of a MC here is not straightforward. 

Methods for estimating the order of a MC of a long sequence have been developed since the 1950s, 
but estimating the order of a MC directly from a set of short reads without assembly has not been studied 
yet. In this paper, we develop two statistics S'j? and and show that both Sj} and have surprisingly 
simple approximate distributions with only two parameters, one of them depending on the order of 
the original long MC sequence, and the other one depending on the distribution of the reads along the 
sequence. Intriguingly, one of these parameters is d = c + 1 under homogeneous sampling, where c is 
the coverage of the reads along the genome based on the first f3 — k + 1 positions of each read. 

Based on the property of Sjf and Z^, we develop an estimator for the order of a MC as well as an 
estimator for the parameter d based on NGS data. Extensive simulation studies are canned out to verify 
the theorem and evaluate the estimator. 

Finally, we apply the estimation methods to two NGS data sets. Since the real genome sequences 
consist of coding, non-coding and various regulatory regions, single standard MC models do not fit the 
data well. Moreover, some enriched patterns, such as the motif sequences, are widespread throughout 
the genomes and violate the simple MC model for the whole genome sequence. Hence studying the 
fraction of k-words whose occurrences can be explained using the statistic (Z^) 2 /d by comparison to a 
x\ distribution is a more realistic way to determine the order of the MC for a real genome sequence. The 
estimated orders are consistent with the orders estimated directly from the full genome sequences using 
BIC methods. 

Our primary motivation for this study is alignment-free genome comparison using NGS data. Further, 
we cluster the 28 species based on the NGS data using MC models with various orders. The results show 
that the clustering performs best and gives stable results when using a MC model with order on and 
above the estimated order. In addition, we apply the same analysis to 13 tropical tree species whose 
reference genomes are unknown; again the best clustering is achieved under a MC with the order within 
the estimated range. 

When the sequence length is short or the sequencing depth is low, the numbers of occurrences of 
some /r-words become small or even zero. Then the assumption of non-zero variance for all word counts 
which underlies the gamma approximation for Sjf no longer holds and the gamma approximation may 
not work well. In such a situation an exact test for the order of MCs in the spirit of Besag and Mondal 
(2013) could be very helpful. In this paper we have only made a start on the Markov chain modelling 
of NGS data. An exhaustive study of errors in the data, in the form of power studies, could help to 
further understand the application range of our results. Finally, in this work we take the estimation of the 
transition probabilities for granted, once the order of the MC is determined. While the estimation of the 


transition probabilities of the MC model of a long sequence has been studied by Anderson and Goodman] 
(1957) and Baum and Petrie (19661, it would be interesting to extend these methods to NGS data. 
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Figure 1: The clustering of the 13 tropical tree species using d\ under MC with order 0 (i.i.d), 4, 8 and 
9. The number on the branch refers to the frequency of the branch occurring among the 30 clusterings 
based on random sampled 10% reads. The letter ‘F’ at the beginning of the names represents Fagaceae', 
similarly the letter ‘M’ represents Maraceae. 
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Supplementary Materials 
A Proof of Theorem 1 


Here we assume that the conditions of Theorem 1 prevail. In order to prove Theorem 1, we introduce 
some notations. Let N w (i) denote the number of occurrences of w in the i-th region, i = 1,2, • • •; 
E w (i) = ^ be the estimated expected number of occurrences of w in the i-th region under 

the (k — 2)-th order MC model; and P w be the probability of w assuming that the MC starts from the 
stationary distribution. Similarly, let N w , E w , and ay, be the observed, expected, and variance of the 
number of occurrences of w along the long genome sequence. The same notations with superscript “R” 
indicate the corresponding quantities based on the short read data. We assume that k is small compared 
to p, and hence the edge effects are small. Then we have the following lemma. 


Lemma 1 


a) For any k-word w,for any region i = 1,... ,B, in distribution, 


lim 

G —^oo 


N*(i)-E*(i) 


a. 


R 


= N 0, 


r ffi 


£.j rjfj 


b) For any k-word w, in probability, 


lim 

G—yoo 



= o. 


( 8 ) 


(9) 


Proof of Lemma[I| To prove part a), note that under the null model that the MC is (k — 2)-th order, 
Theorem 6.4.2 in Reinert et al. (20051 gives that, as sequence length goes to infinity, in distribution, 

N w (i) - E w (i) 


cr v 


(0 


N( 0,1). 


Here we use that the Markov chain assigns non-zero probability to every A;- word w and hence the vari¬ 
ance <t w would be non-zero. As N^(j) = ryN w (j) the corresponding result for N^(j) follows; it 
remains to identify the asymptotic variance. We have for any region i 


lim 
G—yoo 


N w (i) 

Gi 


= /’« 


( 10 ) 


which does not depend on i as the MC is ergodic, and then we have approximately, 


E w (i) = Gi P ™ Pw and 


Thus 


E% = ^2 riE w 


(0 = 


. ^ p p 

\ w w 
' 7 ^ T V 


i 


P- 


* = 1 ) 2 ,. 


lim 

G —^oo 



S j r ifi 


lim 

G —^oo 


lim 

G—y oo 


1 -^ w (i)/iV„V(z) 

1 — n r /n r 

~w f _ w — 

1-N r /n r 

W — 1 ~w~ 


= 1 ; 


= l. 
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From the above three equations, we have 


Km ^| (i) 
G—yoo <T, 




R 




Therefore 


- E R (i) ( riN w (i) - riE w (i) \ a R (i) 


a , 


R 


N 0 , 


^w(*) 

rffi 

>E r ,fj 


a. 


R 


For Part b) of this lemma, note that 


B 


0<iV«-^iV^)<2Mfc 

3 =1 

as the only differences in the counts occur due to not counting occurrences at the boundaries of the 
regions and there are M reads. As A’ w ~ / J W G and as linio'^oc —> 0, the second assertion follows. 

□ 


From Lemmajl] we can easily show the first assertion in Theorem 1 by noting that 

r 7 R_N^~ E R _ ^ N R (i) - E R (i) E* E R {i) - E* 

~R R-R + 


a. 


a. 


a. 


R 


The last summand tends to 0 by Q. When G t is large then the i-tli term is close to a normal distribution 

r 2f. 

with mean 0 and variance 3 , . Since the dependence between the segments is weak, we can treat the 

r iJi 

terms in the first summand as independent. Part a) of Theorem 1 is proved. 

Now we prove the part b) of Theorem 1. Suppose there are only two regions with coverage n and 
region length Gi, i = 1,2. With ([9]), S R has approximately the same distribution as 


= E 


(j^w — ^w(l) — E^,{2)Y 

E R 


EE 

W \ i 


JV5(0 - E«(i) 


JV w (i)-E»(i) rj£*(i) 


EE* 


VE w (i) V 

N-wji) ~ E w (i) 

\fEJj) 


E R 
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where 


Wi = 


lriE R (i) 


rffi 


E R 


) A ''-if.. 


-,* = 1 , 2 . 
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Generally suppose that the reads come from B regions with each region having the same coverage. 
Let r,; be the coverage and G r be the genomic length of the i-th region. Using the same idea as above, 
we can approximate SJf by 


B 


sf* = E E w < 


N w (i) - E w (i) 


w \i =1 \ZE-w (*) 

Note that from Section 6.6.1 in Reinert et al. ( [2005[ ), the vector 

JVw(t) - E w (i)\ 


( 11 ) 


N(i) := 


\J E w (i) 


N (0, S LfcxL fc) , 


veA k 


in distribution, where T, 2 LkxLk is the covariance matrix with rank dfk■ Hence, in distribution, 


E 

w£A k 


N^ji) - E w (i) 
\/E w (i) 


X ldf k )i 


where X(df k ) ' s a X 2 random variable with dfk = (L — 1 ) 2 L k 2 degrees of freedom. On the other hand, 
we can find a L k x dfk matrix M such that E 2 k x f k = MM T . Let M _1 be the pseudo-inverse of M, 
then we have 

Ziii) 




Z dfS), 


= Z(i) 


such that approximately Z(i) ~ N( 0, Idf k )- As N(i) = MZ(i ) we obtain that 

N(i) T • N(i) = Z(i) T M T MZ(i). 

Since the left hand side has approximately a x 2 distribution with dfk degrees of freedom, the right hand 
side should be in distribution close to Ziif'Zii). Thus approximately, M T M = Idf k xdf k ■ Also note that 
M does not depends on i because the correlation structure of N(i) is the same across the regions under 
the null model. Then (111 for SY’* can be written as 


B 


qR ,* _ 


B \ T / B 

J2WiZ{i)] M T MiJ2WiZ(i) 


\ i=l 
' B 


\i =1 


B 


dfk / B 

k =1 \i=l 


, i =1 
, 2 


Since Z k (i) are all approximately i.i.d N( 0,1) random variables and , W 2 ~ ^ )=] ^ 


B w 2 ^ B r ? fi 
3 r iE 


we 


obtain that, in distribution, 

Y k = J2 WiZ k {i) -> N (o, J2 W?) and 


Y k 


i= 1 


i= 1 


y B , w 2 

Z-^l=l i 


N( 0,1). 
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VEf=i^/J 


x 2 te/*)-° 


B Methods 


B.l Estimating the order of a Markov chain based on Theorem 1 


We assume that k > 2 and that the assumptions of Theorem 1 are satisfied. Moreover we assume for now 
that d is known; in practice, the effective coverage d is replaced by the estimated value d. In addition to 
our estimator 


f Sk = argmin fc 


qR 

1 


- 1 


( 12 ) 


we define four related estimators based on Theorem 1; they are all analogs of corresponding established 
estimators for the order of a Markov chain. 


1. Instead of using S R directly, we can calculate the p-value 

Pk = P ($? > sfj P (s£/d > sf/d) = P (x% k > 8%/d) , 

where s R is the observed value of S k ’ based on the short read data. We expect pf : to be small for 
k < r + 2 while pi- will not be that small for k > r + 2. Therefore, we expect log(pfc+i)/ log(pfc) 
to be the smallest when k = r + 1. Thus we can also estimate the order of a MC by 


f Pk = argmin fc 


log(Pfc+i) 

log (Pk) 


- 1. 


(13) 


2. For a given significance level a, we check consecutively if pk+i < a and stop when pk+i > a. 
We estimate r by 

fh = argmin fc {pfc + i > a} — 1, for a given significant level a. (14) 


To avoid early stopping, we can also require that both pi- and pk+i are larger than a. 

3. If k > r + 2, then Z^/Vd, is approximately standard normal iV(0,1) and thus (Z^) 2 /d has 
approximately a ^-distribution with one degree of freedom. If /;: < r + 2, for some k -word w, 
(Z^) 2 /d is generally larger than a ^-distributed random variable with one degree of freedom. 
Therefore we would expect Z^ ax (k) to be large when k < r + 2 and Z^ ]:ix (k) to be relatively 
small for k > r + 2. As Z^ ax is the maximum value over L k variables, we should divide Z^ ay .(k) 
by L k . Therefore, we estimate the order of the MC r by 

—1, (15) 


where Z^Jk) = max w , H=fc \Z*\. 


4. Extending the method by Morvai and Weiss (2005) and Peres and Shields (2005), for a set of short 
reads and a (k — l)-word v = v\ ■ ■ ■ v^-i, define 


A fc_1 (v) 


max 

aeA 


N, 


R 


n r n r 

v a v . 

n r ’ 
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then 

A k = max (A fc_1 (v)} = max 
v£.A fc—1 wG^4 fc 

is the maximum difference between the number of occurrences of a k- word and its estimated 
expectation under the (fc — 2)-th order MC. Our Peres-Shields-type estimator is 


N R N r 

AT-TL _ W W 


N R 


rps(x) = argmax fc 


A k 
A k + 1 


- 1. 


(16) 


B.2 Estimating the order of the MC based on modified AIC and BIC 


Several methods based on the Akaike information criterion (AIC) and the Bayesian information criterion 


(BIC) have been proposed to estimate the order of MC based on long sequences (Hurvich and Tsai 1989 


Katz||198T| Narlikar et al.| 2013; Tong 1975 Zhao et al. 20011. The AIC and BIC for long sequences 


are defined by 


AIC(k) = — 21og(Sequence-Likelihood, Mk) + 2|Affc|, (17) 

AICc{k) = AIC(k) + 2 |^ fcl(l ^ fcl + 1) ,and (18) 

BIC(k ) = — 21og(Sequence-Likelihood, Mk) + \Mk\ log \S\k, (19) 

where At/,, indicates the k-th order Markov chain, A!/,, is the number of parameters for the fc-th order 
Markov model, i.e. |Af k\ = {L — 1 )L k , and \S\k is the size of the data. For a long sequence of length 
G, we have that \S\k = G — k is the number of (k+1) tuples. 

However, no formulas have been defined for AIC and BIC of Markov models based on NGS data. 
In the following we modify the definitions of AIC and BIC given in ( [17] ), ( fi~8| ) and ( [T9| ) respectively so 
that they are applicable to NGS data. For a NGS short read sample, we define the log-pseudo-likelihood 
under the k-th order MC model as, 

N r 

log( p LH R ) = J2 ^wlog^, 

wg^fc+i w 


by replacing N w with N R in the log-likelihood of a long sequence. We think of the factor d as the 
effective coverage of the reads along the genome. So the pseudo-likelihood of the NGS data under 
the k-th order of MC model is approximately the likelihood of the effectivele covered region along the 
genome to the power of d, i.e. p LH R ~ (likelihood of the covered genomic region)^. Thus, the log- 
likelihood of the covered genomic region is approximately \og( p LH R )/d. Based on the idea of AIC for 
long sequences, we minimize —2 log ( p LH R )/d + 2(L — 1 )L k with respect to k. Therefore, we propose 
the AIC for k=l, 2, ..., based on NGS data as 

AIC R {k) = —2 \og{ p LH R ) + 2 d(L - 1 )L k . 


To define BIC and AICc for NGS data, we also need to find a suitable analog of the data size <5|/ ; . 
in ([18]) and (jT9]). A naive substitute is the total effective read length M(/3 — A:), or in other words, the 
number of (k + l)-tuples used to estimate the likelihood under the A:-th order MC model. However, the 
reads can overlap and adjustments are needed. Using the effective coverage factor d as a normalization 
factor from the NGS to long sequence, the length of the effective covered genomic region is ATLA) 
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Then we define AICc and BIC for the NGS read data as 


p . p 2(L — l)L fc ((L - l)L k + l) 

AICc R (k = AIC R (k + } , \ ---f ,and 

_ (L _ 1)L fc _ 2 ) 

M*(0- k ) 


BIC R {k) = —2 \og( p LH R ) + d(L - l)L fc log 


R 


We define the estimators of the order of a Markov model based on AIC, AICc and BIC for the NGS read 
data by 


r A ic R = argmin k AIC R (k) 

(20) 

r aicc r = argmin k AICc R (k) 

(21) 

r bic r = argmin k BIC R {k) 

(22) 


In order to see the effect of the normalization by the effective coverage factor d, we denote by taic* 
taicc and f'B jc the naive estimators, which are obtained by simply substituting the log-sequence- 
likelihood and the size of the data by the log-pseudo-likelihood and the number of (k-i-l)-tuples in the 
NGS read data without normalization by d. 


C Results 


C.l Simulation studies for the validation of the theoretical results 

For the simulation studies we first generate MCs of different orders. For realistic parameter values, 
the transition probability matrices of the MCs are based on real cis-regulatory module (CRM) DNA 


sequences in mouse forebrain from Blow et al. (2010). We start the sequence from the stationary distri¬ 
bution. In addition to the first order MC model described above, we also simulate a second order MC. 
Table [ST] shows the transition probability of a) the first and b) the second order MC matrices we used in 
the simulations. 

We generate MCs with length of G = 10 5 and 2 x 10 5 . We simulate NGS data by sampling a varying 
number of reads of different lengths from the MC as follows. We use the Lander-Waterman model for 
physical mapping (Lander and Waterman] 19881 to sample NGS reads homogeneously with read length 
/3 = 100, 200,300,400, 500 and the number of reads M = 500,1000. The coverage c is calculated as 
c = M(3/G and the effective coverage based on the first (3 — k + 1 positions is d = c e q- + 1, where 
c eff = M(/? — k + 1)/(G — k + 1). Under each combination of (G, M, f3), we calculate the values of 
Z R for each word w and S R based on the NGS read data. Then we repeat the processes 2000 times to 
obtain the empirical distribution of Z R and S R . Finally, we compare the mean and variance of S R with 
their corresponding theoretical approximations and test the fit of the data to the theoretical approximate 
distribution using the Kolmogorov-Smirnov (KS) test. 

With the current NGS technologies, the reads are generally not homogeneously sampled from the 
genome. In order to see the effects of inhomogeneous sampling on the approximate distributions of S R 
and Z R , similarly as in Song et al. (2013) we implement the following simulation. We divide the long 


sequences into 100 blocks, bi ,..., bt ,..., h\na- For each block bt, the sampling probability A i for each 
position in this block is proportional to a random number which is drawn independently from the gamma 
distribution T(l, 20) (one number per block). 

It has been observed that the probability that a fragment is sequenced in NGS depends on its nu¬ 
cleotide content. Empirical studies showed that the dependency on GC content is unimodal and we use 


the empirical unimodal distribution on GC curve shown in the software developed by Benjamini and 
Speed (|2012 1 as an example to generate the reads. The other parameters are the same as before. 
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(a) The transition probability matrix of the first order MC 



A 

C 

G 

T 

A 

0.39 

0.15 

0.21 

0.25 

C 

0.31 

0.19 

0.12 

0.38 

G 

0.31 

0.23 

0.20 

0.26 

T 

0.26 

0.18 

0.21 

0.35 


(b) The transition probability matrix of the second order MC 



A 

c 

G 

T 

AA 

0.39 

0.19 

0.19 

0.24 

AC 

0.25 

0.26 

0.21 

0.27 

AG 

0.38 

0.17 

0.29 

0.15 

AT 

0.30 

0.24 

0.22 

0.24 

CA 

0.46 

0.20 

0.15 

0.18 

CC 

0.26 

0.40 

0.11 

0.22 

CG 

0.23 

0.16 

0.14 

0.48 

CT 

0.29 

0.20 

0.15 

0.37 

GA 

0.26 

0.25 

0.23 

0.26 

GC 

0.25 

0.17 

0.33 

0.25 

GG 

0.26 

0.30 

0.28 

0.17 

GT 

0.29 

0.12 

0.22 

0.37 

TA 

0.19 

0.19 

0.21 

0.41 

TC 

0.12 

0.29 

0.24 

0.35 

TG 

0.27 

0.17 

0.25 

0.32 

TT 

0.11 

0.18 

0.24 

0.47 


Table S1: The transition probability matrices of a) the first and b) the second order Markov chain in our 
simulation studies. 
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Sequencing error is another factor that reduces the data quality. Currently, Illumina sequencing has 
an error rate about 0.1% and 454 sequencing has an error rate about 1% ( Glenn[ 2011). In order to see 
the effect of sequencing errors on the distribution of S[' ! and Z^, in each position of a read, we randomly 
replace the letter in that position with one of the other three letters with equal probability 0.005; the 
different letter is drawn with equal probability. The remaining simulation steps stay the same as before. 

Next we evaluate the proposed estimators of the order of MCs, fs k , r Pk , fh, fz k and fps and the 
estimator for the effective coverage d based on equation (7) in the main text, respectively. For estimating 
the effective coverage d by equation (7) in the main text, we using 3-tuples for a first order MC, and 
4-tuples for a second order MC. For evaluating the proposed estimators, we let A: range from 2 to 6, i.e. 
we choose the model from 1st, 2nd, • • •, 5th order MC. The performance of the estimator is measure by 
the precision rate, i.e. the percentage of times in 1000 repeats the estimator gives the true order. In order 
to see the effect of genome length and sequencing depth on the estimators, we take the genome length 
5000, 7500, 10000, 20000, 50000, while fixing the read coverage to be 1; we take the read coverage 
0.05, 0.1, 0.25, 0.5, 0.75, 1, while fixing the genome length to be 10 5 . We also set the sequencing error 
rate at 10%, which is relatively high compared to the true sequencing error rate in real sequencing, as to 
distinguish clearly between the estimators with respect to their robustness to sequencing errors. For the 
estimator fh, we set a = 0.05. 

When the sequence length is short or the coverage is low, it is possible that the numbers of occur¬ 
rences of some A:-words are close to zero and the expected numbers of occurrences are very low; the 
estimated expected number of occurrences may turn out to be 0 for some A-words. In order to overcome 
the issue, we add a pseudo count of 1 to the number of occurrence of all words, which is a common 
procedure to avoid division by zero and is equivalent to incorporating a flat prior during the parameter 
estimation in terms of Bayesian statistics, see Narlikar et al. (20131; Strelioff et al. ( [2007 ). 


C.1.1 Simulation results for Theorem 1: Homogeneous sampling of the reads 

We use simulations to validate the theoretical results in Theorem 1. In our simulation study, here we first 
generate sequences with genome length of G = 1 X 10 5 and 2 x 10 5 bps under the first order MC model 
as shown in Table S 1(a). S% with their - corresponding theoretical approximations in Theorem 1 are given 
in Table S2. It can be seen from the table that the approximate mean and variance of S 3 are very close 
to their simulated values except in the case of a large number (M = 1000) of very short reads (j3 = 100) 
in a small genome (G = 10 5 bps). 

Moreover, Figure [sT[ a) shows typical Q-Q (Quantile-Quantile) plots of S 3 /d versus the distribution 
of \ 3 6 , where the subscript 36 indicates the degree of freedom of the % 2 distribution, under different 
models of sampling reads with/without sequencing errors. Similarly, Figure |S2[a) show the Q-Q plots 
of Z^cT/'Jd versus the standard normal distribution under different scenarios; here, the word under 
consideration is w = ACT. The theoretical approximations work well in this situation. 

Simulation studies under the higher order MC model and on genomes inserted with repeated regions 
are carried out in a similar fashion (data not shown). The same conclusion that the theoretical approxi¬ 
mate distributions of Sj} and Z^ fit their simulated distributions well holds. 


C.1.2 Simulation results for Theorem 1: The effect of inhomogeneous sampling and of sequencing 
error 

We use Q-Q plots to show the effects of inhomogeneous sampling and sequencing errors on the distri¬ 
bution of Sf} and Z^ for (G,f3,M) = (10 5 , 200, 1000). Figure SI (a, b, c, d) shows the Q-Q plots 
of the 2000 S 3 /d scores from (a) homogeneous sampling, (b) inhomogeneous sampling with rate not 
depending on GC content, (c) inhomogeneous sampling with rate depending on GC content, and (d) ho¬ 
mogeneous sampling with sequencing errors v.s. 2000 scores sampled from distribution. The factor 
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Figure SI: Q-Q plots of the 2000 S^/d scores v.s. 2000 scores sampled from a x| 6 distribution; 
( G,f3,M ) = (10 5 , 200, 1000). a): homogeneous sampling without error, b): inhomogeneous sam¬ 
pling without error, c): inhomogeneous sampling with sampling rate depending on GC content, d): 
homogeneous sampling with error. 


24 









Theoretical Quantiles 

(a) homogeneous, without error 


Theoretical Quantiles 

(b) inhomogeneous, without error 



Theoretical Quantiles Theoretical Quantiles 

(c) inhomogeneous on GC content (d) homogeneous, with error 


Figure S2: Q-Q plots of the 2000 Z^ CT /\fd scores of the word ACT v.s. the 2000 scores sampled 
from the standard normal distribution; ( G,f3,M ) = (10 5 , 200, 1000). a): homogeneous sampling 
without error, b): inhomogeneous sampling without error, c): inhomogeneous sampling with sampling 
rate depending on GC content, d): homogeneous sampling with error. 
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(a) Genome length G = lx 10 5 



M = 500 

M = 1000 

p 

mean 

mean 

var 

var 

p-value 

mean 

mean 

var 

var 

p-value 

100 

53.4 

54.0 

160.0 

162.0 

0.06 

70.8 

72.0 

280.0 

288.0 

0.002 

200 

71.6 

72.0 

211A 

288.0 

0.99 

106.9 

108.0 

660.5 

648.0 

0.15 

300 

89.8 

90.0 

482.9 

450.0 

0.39 

143.9 

144.0 

1139.5 

1152.0 

0.15 

400 

107.6 

108.0 

655.0 

648.0 

0.84 

179.8 

180.0 

1901.3 

1800.0 

0.20 

500 

126.0 

126.0 

905.2 

882.0 

0.86 

216.6 

216.0 

2675.3 

2592.0 

0.26 


(b) Genome length G = 2 x 10 5 



M = 500 

M = 1000 

P 

mean 

mean 

var 

var 

p-value 

mean 

mean 

var 

var 

p-value 

100 

44.6 

45.0 

110.9 

112.5 

0.12 

53.7 

54.0 

142.7 

162.0 

0.61 

200 

53.5 

54.0 

159.6 

162.0 

0.13 

71.8 

72.0 

278.4 

288.0 

0.64 

300 

62.7 

63.0 

222.5 

220.5 

0.95 

89.0 

90.0 

440.1 

450.0 

0.41 

400 

71.7 

72.0 

286.0 

288.0 

0.51 

107.4 

108.0 

645.7 

648.0 

0.26 

500 

80.9 

81.0 

357.5 

364.5 

0.69 

124.7 

126.0 

821.3 

882.0 

0.99 


Table S2: Comparison of mean and variance of with their corresponding theoretical approximations 
under a first order MC model and the fit of the data to the theoretical approximate distribution using 
the KS test. The simulation process was repeated 2000 times for each combination of (G, M, fd). The 
columns mean and var are the simulated mean and variance; the columns mean and var are the theoretical 
mean and variance. 


d is c e ff + 1 in homogeneous sampling; and d is calculated from the exact distribution of the sampled 
reads along the sequence in the inhomogeneous sampling situation. Figure S2 gives the Q-Q plots for 
Zact /showing the effect of inhomogeneous sampling and sequencing error on the distribution of 
Zact / 'd r d- All Q-Q plots show a satisfactory fit, confirming that the theoretical results from Theorem 1 
hold even when the assumptions are not necessarily satisfied. 

C.1.3 Simulation results on estimating the order of MCs based on simulated NGS reads 

Figure[S3]shows the effects of sequence length and read coverage on the precision of the estimators under 
a first and second order MCs, with homogeneous sampling and sequencing error rate 10%. It can be seen 
that all the five estimators perform reasonably well. In particular, the precision rate of estimators fg k , 
f ph , fz k and rps reach 100% when the genome length is larger than 20000 bps and the read coverage 
is greater than 0.2. The estimator fy, performs slightly worse than the other four estimators. Since the 
estimator fy is based on hypothesis testing with a given significant level a, the precision rate is not able 
to reach 100%. It is also possible that no k from 2 to 6 satisfies pk-i < ol and both yy, and pk+i are 
larger than a such that the estimator fy fails to give an estimation. We observe that the precision of fy, 
is sensitive to the accuracy of the estimation of the effective coverage d. In the simulation, if we take the 
underlying true value of d in place of the estimated value of d in the computation, the precision rate of 
fh goes up to above 90%. 

For inhomogeneous sampling, Figure [S4| shows the effects of sequence length and read coverage on 
the precision of the estimators. We can see that the precision rates under the inhomogeneous sampling 
start at slightly lower values than those under the homogeneous case. As the genome length and the read 
coverage increase, the estimators perform as well as they perform under the homogeneous sampling. 

Figure [S5] and [S6] show the disappointing precision rates of the AIC and BIC based estimators of the 
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Genome Length 

(a) a first order MC, c = 1, homogeneous 


Coverage 

(b) a first order MC, G = 10°, homogeneous 



Genome Length 

(c) a second order MC, c = 1, homogeneous 


Coverage 

(d) a second order MC, G = 10 5 , homogeneous 


Figure S3: The precision rates of the estimators for the order of a MC under a first and a second order 
MC, with homogeneous sampling and sequencing error rate 10%. (a,b): The effects of genome length 
and read coverage to the precision rates under a first order MC. (c,d): The effects of genome length and 
read coverage on the precision rates under a second order MC. 
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Genome Length 

(a) a first order MC, c = 1, inhomogeneous 



i i i i r 

0.2 0.4 0.6 0.8 1.0 

Coverage 

(b) a first order MC, G = 10 s , inhomogeneous 



Genome Length Coverage 

(c) a secomd order MC, c = 1, inhomogeneous (d) a second order MC, G = 10 s , inhomogeneous 

Figure S4: The precision rates of the estimators for the order of a MC under a first and a second order 
MC, with inhomogeneous sampling and sequencing error rate 10%. (a,b): The effects of genome length 
and read coverage to the precision rates under a first order MC. (c,d): The effects of genome length and 
read coverage on the precision rates under a second order MC. 
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order of a MC under a first and a second order MC, with homogeneous (Figure [S5]> and inhomogeneous 
sampling (Figure [S6| ). The sequencing error rate is 10% and read length is 100. It can be seen that 
the normalized estimators r AIC r, t a iCc R an d ' tire 11 (i n solid lines) have better performance than the 
their corresponding naive estimators tajc, taiCc and rpic (i n dotted lines). The BIC based estimator, 
f bic r ’ has generally the best performance among all the AIC and BIC based statistics. However, the 
precision of t bic r is low compared to the five estimators rs k , r Pk , fh, f'z k , and rps- With the increase 
of the genome length and read coverage, although r BIC R is able to reach a very high precision rate at 
some point, the precision finally drops with further increase of the genome length and the read coverage, 
and it fails to give a consistent estimation. The AIC and AICc based estimators, taic and f, 4 /cv., do not 
differ much in the precision rate. 


C.1.4 Simulation results on estimating the effective coverage d 


For inhomogeneous sampling of the reads, it is not clear how we estimate the parameter d in Theorem 
1 based purely on the naive read coverage c. In equation (7) in the main text, we propose a method to 
estimate d based on the statistics (Z^ j 2 . w £ A 1 ' . We assess its accuracy using the average relative 


T \dj-d | 


where d, is the estimated value of d in the i-th simulation out 


error, defined by MRE = (U 
of the total T repeats, and by the root-mean-relative-squared-error (RMRSR) defined as RMRSR = 

!\/e fiTAovr. 

The values of d for the four combinations of (/3, M) and the average of their estimations are given in 


the first and second rows of Table S3 The MRE and RMRSR for the different combinations of (/?, M) 
are given in the third and fourth rows of Table [S3] Table [S3] shows the results for (a) the homogeneous 
and (b) the inhomogeneous sampling of the reads using k = 3. In general, although the estimation of 
the effective coverage d becomes slightly inaccurate as the read coverage increases, the estimation is 
reasonably good. Under the inhomogeneous sampling, the relative errors are slightly higher than those 
under the homogeneous case, which reflects the greater randomness the inhomogeneous sampling brings 
into the model; while the MRE error seems unaffected by the choice of d, the RMRSR shows a tendency 
to increase with increasing d. 

We only consider (/?, M) = (100, 2000) for inhomogeneous sampling with sampling rate depending 
on GC content as before. In this case, the value of d is 2.08. The mean value of estimated d is 2.16, the 
MRE is 0.273, and the RMRSR is 0.358 indicating that the value of d can be accurately estimated. 


(0, Af) = (100,1000) (100,2000) (500,1000) (500,2000) 


(a) homogeneous sampling of reads 


d 

2 

3 

6 

11 

d 

2.05 

3.08 

6.16 

11.35 

MRE 

0.278 

0.283 

0.281 

0.279 

RMRSR 

0.328 

0.328 

0.339 

0.338 

(b) inhomogeneous sampling of reads 

d 

2.80 

4.60 

9.56 

18.15 

d 

2.85 

4.92 

10.03 

18.86 

MRE 

0.283 

0.281 

0.281 

0.292 

RMRSR 

0.321 

0.348 

0.353 

0.359 


Table S3: The estimation of the effective coverage d for a) the homogeneous and b) inhomogeneous 
sampling of reads with G = 10 5 bps, j3 = 100,500 bps, and M = 1000,2000; d = 

T=2000. 
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(a) a first order MC, c = 1, homogeneous 


(b) a first order MC, G = 10°, homogeneous 



(c) a second order MC, c = 1, homogeneous 



Coverage 

(d) a second order MC, G = 10 5 , homogeneous 


Figure S5: The precision rates of the AIC and BIC based estimators for the order of MC under a first 
and a second order MC, with homogeneous sampling and sequencing error rate 10%. (a,b): The effects 
of genome length and read coverage to the precision rates under a first order MC. (c,d): The effects of 
genome length and read coverage on the precision rates under a second order MC. 
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(a) a first order MC, c = 1, inhomogeneous 



Genome Length 


(c) a second order MC, c = 1, inhomogeneous 



(b) a first order MC, G = 10 s , inhomogeneous 



Coverage 

(d) a second order MC, G = 10 s , inhomogeneous 


Figure S6: The precision rates of the AIC and BIC based estimators for the order of MC under a first and 
a second order MC, with inhomogeneous sampling and sequencing error rate 10%. (a,b): The effects 
of genome length and read coverage to the precision rates under a first order MC. (c,d): The effects of 
genome length and read coverage on the precision rates under a second order MC. 
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C.2 The relationship among 28 vertebrate species 

Table |S4|shows the estimated orders of MC for the 28 genomes of vertebrate species based on their NGS 
samples. For each of the 28 species, we compute the fraction of the k-tuple words that have (Z^) 2 /d 
within the 99% of a x 2 distribution with one degree of freedom, for k = 8,9,..., 14. Using 80% as 
a threshold, we estimate the orders of MC for each species to be the smallest k — 2 under which the 
fraction of words that can be explained by the (k — 2)-th order MC is greater than the threshold. The 
last two columns of Table [S4| are the AlC-predicted optimal and BIC-predicted optimal orders obtained 
Narlikar et al.| ( j2013| . 

using d 2 and df under MC 


in 


We cluster the NGS datasets of the 28 species listed in the Table 


S4 


models of varying order. Figure SI in Miller et ak| (20071 gives the phylogenetic tree of the 28 species 
based on alignment methods, with branch lengths noted on it. We obtain the distance matrix of the 28 
species by computing pairwise disstance between each pair of the 28 species from the phylogenetic tree; 
the matrix is given as Table [S5j We compare the dissimilarity matrices using d 2 and df under MC models 
with different order to the matrix Table [S5] which we use as underlying truth. 


C.3 The relationship among 13 tropical tree species with unknown reference genomes 

We also apply our method to the 13 tree species (8 Fagaceae and 5 Moraceae) based on the NGS shotgun 
read data sets in Cannon et al.j ( 2010j ). The reference genome sequences for the 13 tree species are 
unknown. Figure 1 in the main text shows the clustering results using d 2 under MCs of order 0, 4, 8 and 
9. For the clustering results using df, see Figure [sv] 
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Table S4: Estimating the order of the MC for 28 species based on NGS read samples. The ’genome’ column is the scientific names shown in 
UCSC; ’estimated order’ is the minimum order of MC r that can explain greater than 80% of the (r + 2)-tuple word at type I error 0.01; ’APO’ and 
’BPO’ columns show the AIC-predicted optimal and BIC-predicted optimal orders obtained in |Narlikar et al. ( 2013) ; the middle columns with order 
k = 8, 9,..., 14 are the fraction of words w with length k that have (Z^) 2 /d within the 99% of a y 2 distribution with one degree of freedom, ’na’ 
means the numbers are not provided in|Narlikar et al. (2013 . 
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fable S5: 

The pairwise distance matrix obtain from the Figure S1 

in 

Miller et al. 

2007). Figure SI in 

Miller et al. 

2007 

shows the phylogenetic tree 


of the 28 species based on alignment methods with branch lengths noted on it. 
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Figure S7: The clustering of the 13 tropical tree species using df under MC with order of O(i.i.d), 4, 8 and 
9. The number on the branch refers to the frequency of the branch occurring among the 30 clusterings 
based on random sampled 10% reads. The letter ‘F’ at the beginning of the names represents Fagcicecie\ 
similarly the letter ‘M’ represents Maraceae. 
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